I am excited about this project because I think I may be able to learn some R things.
My GitHub repository is at https://github.com/pyrysipila/IODS-project
Describe the work you have done this week and summarize your learning.
I have worked on data wrangling and linear regression. I think I have learned a lot. Please see below.
learning2014 <- read.csv("//atkk/home/p/pyrysipi/Documents/Tutkimus/Kurssit/Introduction to Open Data Science 2018/Data/learning2014.csv")
The data were collected from students in Jyväskylä University in 2014-2015. The aim was to assess how attitude and different learning approaches predict success on an introductory statistics course.
More information on the data can be found here: http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt
Below I will explore the structure and dimensions of the data.
#dimensions of the data:
dim(learning2014)
## [1] 166 7
#the data have 166 observations and 7 variables
#structure of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
#First six rows of the data:
head(learning2014)
## gender Age attitude deep stra surf Points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
Next, I will describe the data with nice plots and summaries of the variables
# A scatter plot matrix
pairs(learning2014[-1], col = learning2014$gender)
library(ggplot2)
library(GGally)
p1 <- ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))
p1
summary(learning2014)
## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
From the plots and summaries above, it can be seen that the data consists of 110 women and 56 men. Mean age is 25.5 years. Age is unevenly distributed so that there are some people who are much older than average but no people who are much younger than average. This is called a skewed distribution. Other variables have a distribution that at least resembles a normal distribution.
Attitude has a moderate correlation with points (r~0.4). (Better attitude is associated with better success on the course.) Deep (deep approach) has a moderate negative correlation (r~-0.3) with surf (surface approach). This adds to the validity of these constructs, because one would assume deep and surface approaches to be antagonistic with each other. Correlation between other variables are low (|r|<0.2)
Next, I will fit a linear regression model with points as the dependent variable.
lm1 <- lm(Points ~ attitude + deep + stra, data = learning2014)
summary(lm1)
##
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
The intercept is statistically significant with an aplha level of 0.05 (p<0.05). Attitude is a statistically significant predictor of points but deep and stra are not. I will drop deep and stra from the model.
lm2 <- lm(Points ~ attitude, data = learning2014)
summary(lm2)
##
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
For every one points increase in attitude, the student will gain on average ~3.5 points. R-squared is ~0.19, which means that the model explains 19% of the variation in Points. Because attitude is the only vairable in the model, it explains those 19% of variation in Points.
Next, I will draw the diagnostic plots of the linear model I have just constructed (lm2).
plot(lm2, which = c(1,2,5))
The residuals vs fitted plot does not show any obvious patterns. This indicates that the model fits the data reasonably well and the data are homoscedastic.
In the normal Q-Q plot the dots are reasonably close to the diagonal line. This means that the distribution of the residuals is reasonably close to normal distributions. Therefore, the assumption of linear regression of normally distributed residuals is not violated.
The residuals vs leverage plot does not identify any points with a high Cook’s distance (no points are below the dashed read line indicating a Cook’s distance of 0.5). This means that there are no outliers that would have potential to distort the regression model.
alc <- read.csv("//atkk/home/p/pyrysipi/Documents/Tutkimus/Kurssit/Introduction to Open Data Science 2018/Data/alc.csv")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
The data were collected from secondary students in Portugal. They contain information on alcohol use and performance in mathematics and portuguese, and some other information as well.
More information on the data can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
I will examine the relationships of sex, Medu (Mother’s education), Fedu (Father’s education) and famsize (family size) with high alcohol use.
My hypotheses are that: 1) men have a higher probability of high alcohol use than women, 2) higher maternal education is associated with lower probability of high alcohol use, 3) higher paternal education is associated with lower probability of high alcohol use and 4) bigger family size is associated withlower probability of high alcohol use and
table(sex = alc$sex, high_use = alc$high_use) %>% addmargins()
## high_use
## sex FALSE TRUE Sum
## F 156 42 198
## M 112 72 184
## Sum 268 114 382
table(Medu = alc$Medu, high_use = alc$high_use) %>% addmargins()
## high_use
## Medu FALSE TRUE Sum
## 0 1 2 3
## 1 33 18 51
## 2 80 18 98
## 3 59 36 95
## 4 95 40 135
## Sum 268 114 382
table(Fedu = alc$Fedu, high_use = alc$high_use) %>% addmargins()
## high_use
## Fedu FALSE TRUE Sum
## 0 2 0 2
## 1 53 24 77
## 2 75 30 105
## 3 72 27 99
## 4 66 33 99
## Sum 268 114 382
table(Family_size = alc$famsize, high_use = alc$high_use) %>% addmargins()
## high_use
## Family_size FALSE TRUE Sum
## GT3 201 77 278
## LE3 67 37 104
## Sum 268 114 382
library(ggplot2)
g_sex <- ggplot(alc, aes(high_use, ..count..))
g_sex + geom_bar(aes(fill=sex), position = "dodge")
g_Medu <- ggplot(alc, aes(x = high_use, y = Medu))
g_Medu + geom_boxplot()
g_Fedu <- ggplot(alc, aes(x = high_use, y = Fedu))
g_Fedu + geom_boxplot()
g_famsize <- ggplot(alc, aes(x = high_use, ..count..))
g_famsize + geom_bar(aes(fill=famsize), position = "dodge")
Men seem to be high users of alcohol more often than women and high use seems to be more common in small families (with 3 or less than 3 members) than in big families (with more than 3 members). Thus, my hypotheses 1 and 4 seem to be correct.
In copntrast, maternal education does not seem to be associated with high alcohol use and paternal education seems to be associated with higher probability of high alcohol use. Thus, my hypotheses 2 and 3 seem to be false.
m <- glm(high_use ~ sex + Medu + Fedu + famsize, data=alc, family = "binomial")
m
##
## Call: glm(formula = high_use ~ sex + Medu + Fedu + famsize, family = "binomial",
## data = alc)
##
## Coefficients:
## (Intercept) sexM Medu Fedu famsizeLE3
## -1.40340 0.85627 -0.06956 0.08132 0.29581
##
## Degrees of Freedom: 381 Total (i.e. Null); 377 Residual
## Null Deviance: 465.7
## Residual Deviance: 449.2 AIC: 459.2
Male sex, higher father’s education (per one unit increase in education) and small familysize (3 or less than 3 members) have positive coefficients indicating a positive association with high alcohol use.
Higher mother’s education (per one unit increase in education) has a negative coefficient indicating a negative association with high alcohol use.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2457589 0.1194573 0.4895376
## sexM 2.3543727 1.4978628 3.7375354
## Medu 0.9328005 0.7093940 1.2255525
## Fedu 1.0847204 0.8293229 1.4235393
## famsizeLE3 1.3442147 0.8179982 2.1900766
Men have 2.4 times higher odds than women to be high alcohol users. The association is statistically significant (at alpha level 0.05), because the 95% confidence interval does not include 1. My hypothesis 1 seems to be correct.
Each increase of one unit in mother’s educations makes the odds to be high alcohol users 0.93 times lower. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 2 cannot be confirmed.
Each increase of one unit in father’s educations makes the odds to be high alcohol user 1.08 times higher. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 3 cannot be confirmed.
Those from small families have 1.3 times higher odds than those from big families to be high alcohol users. Nevertheless, the association is not statistically significant (at alpha level 0.05), because the 95% confidence interval includes 1. My hypothesis 4 cannot be confirmed.
Only sex will be included in the predictive model because it was the only variable with a statistically significant association with high alcohol use.
# fit the model
m <- glm(high_use ~ sex, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, high_use, probability, prediction) %>% tail(10)
## sex high_use probability prediction
## 373 M FALSE 0.3913043 FALSE
## 374 M TRUE 0.3913043 FALSE
## 375 F FALSE 0.2121212 FALSE
## 376 F FALSE 0.2121212 FALSE
## 377 F FALSE 0.2121212 FALSE
## 378 F FALSE 0.2121212 FALSE
## 379 F FALSE 0.2121212 FALSE
## 380 F FALSE 0.2121212 FALSE
## 381 M TRUE 0.3913043 FALSE
## 382 M TRUE 0.3913043 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE
## FALSE 268
## TRUE 114
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE
## FALSE 0.7015707
## TRUE 0.2984293
#plot the predictions and actual values
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2984293
The model predicts everyone to be not a high alcohol user. This is wrong in 30% of the cases. The model performs better than flipping a coin: a strategy that would be wrong in 50% of the cases.
10-fold cross validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2984293
The model has a prediciton error of 0.30. I’ll try to make a better one by including the same variables as in the DataCamp excercise and a couple of extra variables:
# fit the model
m <- glm(high_use ~ school + sex + age + address + famsize + failures + absences, data = alc, family = "binomial")
m
##
## Call: glm(formula = high_use ~ school + sex + age + address + famsize +
## failures + absences, family = "binomial", data = alc)
##
## Coefficients:
## (Intercept) schoolMS sexM age addressU
## -3.18210 0.19912 0.92655 0.09062 -0.40680
## famsizeLE3 failures absences
## 0.30259 0.42108 0.09196
##
## Degrees of Freedom: 381 Total (i.e. Null); 374 Residual
## Null Deviance: 465.7
## Residual Deviance: 419 AIC: 435
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
summary(alc$probability)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.09712 0.17396 0.26337 0.29843 0.38262 0.93004
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 79 35
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.65968586 0.04188482
## TRUE 0.20680628 0.09162304
#plot the predictions and actual values
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486911
#perform 10-fold croos-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2643979
Now the prediciton error is 0.27. It seems to be that I cannot beat DataCamp.
A new markdown file has been created.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Boston is a classic dataset describing living conditions such as urban environment, location, crime rate, air quality and economic issues in towns in the suburbs of Boston. It has 506 observations and 14 variables. More information on the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.2.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
#histogram
Boston %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#scatter plot of all variables in the dataset
plot(Boston)
# calculate the correlation matrix,round it and print it
cor_matrix<-cor(Boston)
cor_matrix <- cor_matrix %>% round(digits=2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos ="d",tl.cex=0.6)
Many variables have skewed distributions. Rm (average number of rooms per dwelling) has a nice distribution close to normal.
There are quite a few strong correlations. The strongest positive correlation is between tax (full-value property-tax rate per $10,000) and rad (index of accessibility to radial highways) (0.91) and the strongest negative correlation is between dis (weighted mean of distances to five Boston employment centres) and nox (nitrogen oxides concentration (parts per 10 million)) (-0.77).
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical variable to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
All the scaled variables have mean zero (and standard deviation 1).
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2425743 0.2500000 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 1.0966152 -0.9293940 -0.15302300 -0.9085522 0.39563762
## med_low -0.1201594 -0.3670518 -0.07145661 -0.5953865 -0.13481912
## med_high -0.3836558 0.1629760 0.23442641 0.4041304 0.06831498
## high -0.4872402 1.0170298 -0.01233188 1.0255059 -0.41607572
## age dis rad tax ptratio
## low -0.9654355 1.0074428 -0.6953345 -0.7215075 -0.43483412
## med_low -0.3509550 0.3719930 -0.5588134 -0.5432244 -0.06665715
## med_high 0.4248400 -0.3696329 -0.4246941 -0.3088871 -0.24517066
## high 0.7828194 -0.8404207 1.6390172 1.5146914 0.78181164
## black lstat medv
## low 0.37054878 -0.75571378 0.47561323
## med_low 0.34598532 -0.14093571 0.01107822
## med_high 0.08102381 0.03345678 0.12841597
## high -0.71921086 0.83753641 -0.63614925
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09443167 0.73684375 -0.95754464
## indus 0.09199207 -0.14025300 0.27516049
## chas -0.09335274 -0.09623101 0.01975018
## nox 0.31020209 -0.52735211 -1.42442757
## rm -0.14357802 -0.10349530 -0.18943586
## age 0.21200407 -0.42878099 -0.13105976
## dis -0.04508493 -0.14552812 0.09491612
## rad 3.32313031 1.19234534 0.34723668
## tax 0.11352496 -0.34230065 0.15762337
## ptratio 0.11267842 0.03697259 -0.33956654
## black -0.11622480 0.03142054 0.18077624
## lstat 0.22936706 -0.19385382 0.36223286
## medv 0.23634775 -0.29574077 -0.16814036
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9486 0.0392 0.0122
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 15 2 0
## med_low 8 11 9 0
## med_high 0 5 18 2
## high 0 0 0 21
The proportion of correct predictions is 14/27 (52%) for low crime rate, 20/30 (67%) for med_low crime rate, 16/22 (73%) for med_high crime rate and 23/23 (100%) for high crime rate. Overall, the predictions are quite good (proportion of correct predictions 73/102~72%), and the higher the crime rate the better the predictions.
#Reload and scale the Boston data
data("Boston")
boston_scaled <- scale(Boston)
set.seed(123)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
#visualize it
pairs(Boston, col = km$cluster)
##investigate the optimal number of clusters
#set seed for reproducibility
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
#2 seems to be an optimal number of clusters, because the total within sum of squares drops dramatically when moving from one cluster to two clusters.
#run the algorithm again
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Two is the optimal number of clusters. Two clusters seem to differentiate well between towns of low and high crime rate. Other variables do not seem to be able to clearly differentiate between the clusters.
#Relaod and scale the Boston data
data("Boston")
boston_scaled <- scale(Boston)
#change the data to a data frame
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
#add the clusters to the data
boston_scaled <- data.frame(boston_scaled, km$cluster)
glimpse(boston_scaled)
## Observations: 506
## Variables: 15
## $ crim <dbl> -0.4193669, -0.4169267, -0.4169290, -0.4163384, -0....
## $ zn <dbl> 0.28454827, -0.48724019, -0.48724019, -0.48724019, ...
## $ indus <dbl> -1.2866362, -0.5927944, -0.5927944, -1.3055857, -1....
## $ chas <dbl> -0.2723291, -0.2723291, -0.2723291, -0.2723291, -0....
## $ nox <dbl> -0.1440749, -0.7395304, -0.7395304, -0.8344581, -0....
## $ rm <dbl> 0.4132629, 0.1940824, 1.2814456, 1.0152978, 1.22736...
## $ age <dbl> -0.11989477, 0.36680343, -0.26554897, -0.80908783, ...
## $ dis <dbl> 0.1400749840, 0.5566090496, 0.5566090496, 1.0766711...
## $ rad <dbl> -0.9818712, -0.8670245, -0.8670245, -0.7521778, -0....
## $ tax <dbl> -0.6659492, -0.9863534, -0.9863534, -1.1050216, -1....
## $ ptratio <dbl> -1.4575580, -0.3027945, -0.3027945, 0.1129203, 0.11...
## $ black <dbl> 0.4406159, 0.4406159, 0.3960351, 0.4157514, 0.44061...
## $ lstat <dbl> -1.07449897, -0.49195252, -1.20753241, -1.36017078,...
## $ medv <dbl> 0.15952779, -0.10142392, 1.32293748, 1.18158864, 1....
## $ km.cluster <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
#perform LDA using the clusters as target classes
lda.fit <- lda(km.cluster ~ ., data = boston_scaled)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit)
Rad (index of accessibility to radial highways is the most influential linear separator for the clusters. Other influental linear separators are tax (full-value property-tax rate per $10,000), age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres).
lda.fit <- lda(crime ~ ., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
train2 <- train
train2$num_crime <- as.numeric(train2$crime)
train2$num_crime <- scale(train2$num_crime)
train2$crime <- train2$num_crime
train2 <- select(train2,-num_crime)
km <- kmeans(train2, centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
There is an obvious difference: crime has four classes but kmeans only two clusters. Apart that, The plots look surprisingly similar. The clusters formed by k-means clustering seem to mostly differentiate between the towns with high crime rate vs the towns with lower crime rates.